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Abstract 

With the prevalence of the commodity depth cameras, the 
new paradigm of user interfaces based on 3D motion cap¬ 
turing and recognition have dramatically changed the way 
of interactions between human and computers. Human ac¬ 
tion recognition, as one of the key components in these de¬ 
vices, plays an important role to guarantee the quality of 
user experience. Although the model-driven methods have 
achieved huge success, they cannot provide a scalable so¬ 
lution for efficiently storing, retrieving and recognizing ac¬ 
tions in the large-scale applications. These models are also 
vulnerable to the temporal translation and warping, as well 
as the variations in motion scales and execution rates. To 
address these challenges, we propose to treat the 3D hu¬ 
man action recognition as a video-level hashing problem 
and propose a novel First-Take-All (FTA) Hashing algo¬ 
rithm capable of hashing the entire video into hash codes of 
fixed length. We demonstrate that this FTA algorithm pro¬ 
duces a compact representation of the video invariant to the 
above mentioned variations, through which action recogni¬ 
tion can be solved by an efficient nearest neighbor search by 
the Hamming distance between the FTA hash codes. Exper¬ 
iments on the public 3D human action datasets shows that 
the FTA algorithm can reach a recognition accuracy higher 
than 80%, with about 15 bits per frame considering there 
are 65 frames per video over the datasets. 


1. Introduction 

The recent advances in the commodity depth sensors 
such as Microsoft Kinect, Intel RealSense and LeapMo- 
tion have dramatically changed the way of human-computer 
interaction. The new generation of user interfaces based 
on 3D motion capturing and recognition make the interac¬ 
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tions between humans and computers easier than ever be¬ 
fore. These interfaces have already enabled a wide range 
of applications including video games, education, business 
and healthcare. Behind all these applications, the 3D human 
action recognition plays a key role and directly determines 
the quality of the user experience. 

Although a great number of works [10, 17, 11, 21, 16, 
1! ] have been developed for solving the problem of auto¬ 
matic human action recognition, the modeling of dynamic 
structures of human actions remains challenging due to the 
temporal translation and warping of the action sequences, 
as well as the variation in the motion scales and the execu¬ 
tion rates of the actions [26]. More importantly, the current 
model-driven solutions normally require a dedicated classi¬ 
fier for each class of actions and cannot provide a scalable 
solution for the large-scale action recognition applications. 

Inspired by the success of the hashing techniques in im¬ 
age retrieval [3], we treat the 3D human action recognition 
as a hashing problem of encoding videos with compact bi¬ 
nary sequences of fixed length, so that the similarity be¬ 
tween videos can be compared by the Hamming distance 
between their hash codes preserving the intrinsic temporal 
structure of actions. Thus, action recognition can be solved 
by an efficient approximate nearest neighbor search based 
on the hash codes of videos. Most of the existing hashing 
algorithms [3, 2, 1, 8] are developed for images with the 
fixed resolution/dimension. 

We note there exist some hashing algorithm [23] that 
handles the large-scale video datasets while considering the 
temporal consistency. However, such method still applies 
the hashing to each individual frame rather than the en¬ 
tire sequence. Hence, to measure the video similarity, we 
have to compute the average similarity between each pair 
of frames, which can be computationally prohibitive espe¬ 
cially with a ever-growing length of videos in many applica¬ 
tions. To address this problem, we propose to hash the entire 
video as a whole into the bit sequence of fixed length to fa- 
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cilitate direct computation of Hamming distance. There are 
two challenges facing the hashing of the entire video: (1) 
encoding the temporal structures of actions and (2) dealing 
with the varying length of videos to generate fixed length of 
hash codes. Ideally, we wish that the generated hash codes 
can be resilient against temporal translation and warping, 
variations in scales and execution rates. 

To address the above challenges, we propose a novel 
temporal order-preserving hashing algorithm, namely First- 
Take-All (FTA). The FTA hashing algorithm first applies 
multiple random projections to translate a video into sev¬ 
eral sequences of latent postures. Then it encodes the video 
by the temporal order of the occurrence of these postures. 
Specifically, in each iteration that generates a new hash 
code, a group of k latent postures are randomly selected, 
and then a video is encoded by the index of the posture that 
is acted first. After several iterations, a set of hash codes 
generated in this FTA fashion can capture the temporal or¬ 
der of latent postures acted in a video, and the similarity 
between videos can be measured by computing the Ham¬ 
ming distance between the FTA hash codes. Since the hash¬ 
ing is applied to the entire video, it can normally achieve a 
low bit rate per frame, making it much efficient compared 
with the model-based approaches. In addition, we will show 
that the FTA hashing is invariant to the temporal translation 
and warping, as well as the variations in motion scales and 
execution rates, as long as the temporal-order of the pos¬ 
ture sequence does not change for a class of actions. This 
makes FTA robust against the intra and inter-class variations 
caused by individual actors. 

The FTA hashing algorithm extends the Winner-Take- 
All (WTA) algorithm [20], but differs from it in several sig¬ 
nificant aspects. First of all, WTA is not an algorithm that 
can be applied to hashing varied length of sequences. In 
fact, it must assume that all the input vectors reside in a fea¬ 
ture space of fixed dimension. For this reason, WTA has 
only been applied to hash the data of fixed length like im¬ 
ages and text. Second, WTA compares the order of features 
chosen from the original space. This unnecessarily limits 
its ability in capturing the ranking structure among various 
subspaces. On the contrary, the proposed FTA hashing is 
more expressive in representing the temporal order of latent 
postures obtained by projecting the sequence into several 
subspaces. 

The main contributions of the paper are: 

1. We propose a novel FTA algorithm for hashing videos 
of varied length. The hashing algorithm is invariant to 
temporal translation, scale variation and execution rate 
variation; and 

2 . We perform extensive experiment studies on three pub¬ 
lic 3D human action datasets and demonstrate the per¬ 
formance of the proposed FTA Hashing algorithm by 


comparing it with a baseline method without leverag¬ 
ing the temporal-order information. 

To the best of our knowledge, this is the first work to per¬ 
form hashing on the entire video sequence of varied length 
for the recognition of human actions. It is also worth not¬ 
ing that, the proposed FTA hashing is not limited to hu¬ 
man action videos, it can also be potentially applied as a 
generic temporal hashing algorithm to other types of video 
sequences. 

The remainder of the paper is organized as follows. Sec¬ 
tion 2 briefly reviews the related work. The First-Take-All 
hashing algorithm is introduced and discussed in Section 3. 
Experiments and performance study are presented in Sec¬ 
tion 4. Finally, Section 5 concludes the paper. 

2. Related work 

We review the related works in two following categories. 

Human Action Recognition and Retrieval 

Modeling temporal structure of video sequences is one 
of the most challenging problems in human action recogni¬ 
tion and has attracted intensive research. Many of the exist¬ 
ing approaches focus on extracting the local spatio-temporal 
features and do not explicitly model the temporal patterns 
of the action sequence. Most of these works are histogram- 
based and adopting the bag-of-words framework. For ex¬ 
ample, in [7], the bag-of-3D-points from the depth maps 
are sampled and clustered to model the dynamics of hu¬ 
man actions. Similar ideas are presented in [ 22 ], where the 
Histogram of Gradient (HoG) features are exacted from the 
depth motion maps to classify human actions. Histogram of 
3D joints (HOJ3D) [18] and histogram of visual words [27] 
are also employed to describe the action sequences by us¬ 
ing the joint features. Unfortunately, these histogram-based 
methods do not preserve the temporal order of the primitive 
postures of the action and may lead to the poor performance 
on distinguishing between actions composed of the similar 
postures but in different temporal orders. 

It is obvious that the temporal characteristics of hu¬ 
man actions must be fully explored in order to achieve a 
high recognition rate. Motion template-based approaches 
[10, 26] are introduced to model the temporal dynamics of 
actions, where the Dynamic Time Warping (DTW) algo¬ 
rithm is used to align the sequences of varied length and 
execution rate. On the contrary, the Temporal Pyramid 
[17, 1 ] attempts to represent the temporal structure of the 
sequence by uniformly subdividing the sequence into sev¬ 
eral partitions. With the uniform temporal partition, how¬ 
ever, the temporal pyramid is less flexible to handle the exe¬ 
cution rate variation. The Adaptive Temporal Pyramid [2 ] 
is instead proposed to overcome this problem by adaptively 
dividing the temporal sequence by the motion energy. Yem- 


ulapalli et al. [ 1 ] presents a body-part representation of the 
human skeleton, where the temporal dynamics in terms of 
3D transformation are projected onto a curved manifold in 
the Lie group. 

Hashing Algorithms 

Hashing algorithms are widely adopted in the approx¬ 
imate nearest neighbor search problem [3, 2]. There are 
plenty of works aiming at achieving a higher retrieval rate 
with shorter code length. We categorize these existing hash¬ 
ing algorithms into two families - the space-partitioning 
methods and the ranking-based methods. 

The space-partitioning methods, such as Locality Sensi¬ 
tive Hashing (LSH) [3] and Compressed Hashing [8] nor¬ 
mally partition the whole feature space into a sequence of 
half spaces and quantize the original features into binary 
bits by these half spaces. In order to preserve the Euclidean 
distances between high dimensional vectors with sufficient 
precision, long codes are usually required for these meth¬ 
ods. To overcome this drawback, Multi-Probe LSH [9 ] and 
entropy-based LSH [1 ] are proposed to reduce the stor¬ 
age burden at the cost of increasing the query time. On the 
other hand, different paradigms of hashing algorithms have 
been developed to approximate distance metrics other than 
the Euclidean distance, such as the p-norm distance] and 
the Mahalanobis distance [5]. Spherical LSH [14] works 
for hashing a set of points on the hypersphere of an input 
space. There also exist several works kernelizing the LSH 
approaches by considering the Reproducing Kernel Hilbert 
Space (RKHS) [4, 13]. 

Unlike space-partitioning methods, the ranking-based 
hashing methods encode the ordinal relation between the 
original features rather than their magnitudes. For exam¬ 
ple, Min Hashing [1] approximates the Jaccard similarity 
coefficient between two sets by encoding a set with the 
minimal value of a hash function over its members. Re¬ 
cently, the Winner-Take-All (WTA) Hash [2( ] has been pro¬ 
posed to encode the magnitude orders of randomly permu- 
tated features. The resultant hash codes are scale-invariant, 
and are often more resilient against the noises. Some other 
ranking-based hashing algorithms, like the Rank-Sensitive 
Hash [15], can be regarded as a special case of the WTA 
when the window size of ordinal comparison is 2. 

However, to the best of our knowledge, there exists no 
hashing algorithm aiming at encoding the temporal struc¬ 
ture of entire sequences of varied length. Ye et al. [2. ] 
introduce a supervised hashing algorithm for video se¬ 
quences. However, it still performs hashing on individual 
frames, rather than on the entire video. Then the video 
similarity has to be computed by the average Hamming 
distance between each pair of frames. In contrast, we at¬ 
tempt to expand the scope of ranking-based hashing meth¬ 
ods to directly explore the temporal order of actions on the 


video level. This can yield much compact hash codes for 
videos, and the similarity between videos can be directly 
computed by the Hamming distance between the video¬ 
level hash codes. 

3. Temporal Order-Preserving Hashing 

In order to hash human action sequences into binary bits 
while preserving their temporal order, we apply random 
projection to a video. This generates a sequence of con¬ 
fidence scores measuring whether each frame of the video 
belonging to a (unlabeled) posture corresponding to the ran¬ 
dom projection. We generate several random projections 
that are applied to a video, and encode the video with the 
index of the random projection of which the peak of the 
confidence score comes the first. This is why our algorithm 
is name First-Take-All (FTA). 

This process will be repeated several times, and a se¬ 
quence of hashing codes will be generated for a video. In 
this fashion, the temporal order in which the postures are 
performed in a video will be encoded. In Section 3.1, we 
formalize the algorithm, and then we will explain the intu¬ 
itive idea behind the formal description. 

3.1. Formal Description 

In this section, we formally present the proposed tempo¬ 
ral order-preserving hashing algorithm for video sequence 
- First-Take-All (FTA). 

Random Projections for Latent Postures 

The method can be formulated as follows. Suppose a 
video of length n is represented as a sequence of frames 
X = [xi, x 2 • • • , x n ]. Each frame x^ G i = 1, • • • , n, 
is represented by a d-dimensional feature vector. First, we 
generate m random projections W = [wi,W 2 , • • • , w m ], 
each wi G is drawn from a multivariate Gaussian distri¬ 
bution, i.e., w i A/"(0, cr 2 I) for Z == 1 , • * • , m. 

As aforementioned, each random projection can be inter¬ 
preted as forming a linear subspace for a unknown posture. 
Hence, the inner product Sij = represents the con¬ 

fidence score of posture l on frame i. In a compact matrix 
form, we can use a matrix of size m x n 

S ^ [%] mX n = WTX 

each row l of which represents the application of random 
projection w i to the entire video sequence. 

First-Take-All 

Our goal is to design a hashing coding mechanism to pre¬ 
serve the temporal order structure. With the posture se¬ 
quence in each row of the resultant matrix S, we can 
find which posture is first performed to encode the video. 



Figure 1: Illustration of the FTA by peak approach when 
K = 3 

Formally, we randomly choose k rows indexed by JC = 
{/i, • • • , Ik} from S, each row representing the sequence of 
a latent posture. When k = 2, it is a pair-wise comparison 
of the temporal orders between two postures. Otherwise, 
when k > 2, the comparison is made between multiple pos¬ 
tures. 

To decide which of postures comes first, we need to test 
when each posture is performed for the first time in the se¬ 
quence. Here we introduce two approaches. 

FTA by peak The first approach is to use the peak of confi¬ 
dence score to denote the first time a posture is acted. Sup¬ 
pose for posture lj G JC, its peak is attained at frame , 
i.e., 

ii = arg max {Si i\Si i > 9} 

J 1 <i<n J J 

So ii j denotes the time when the confidence that posture 
lj is performed reaches the peak. Note that here we re¬ 
quire that the confidence score should be larger than a 
threshold 6 for a frame i to be considered as posture lj . This 
is used to rule out those less salient postures and provide a 
more robust result for the following temporal order compar¬ 
ison. If the confidence score for posture lj has never passed 
this threshold, it implies the posture might have never been 
performed. In this case, we set ii to +oo by convention. 

Then the posture that comes first is given by 

j* = arg min i t , s.t. lj G /C (1) 

l<j<k 

where JC is the selected rows for the current iteration. We 
will use j* as the hash code for the video, which denotes 
the index of the posture first performed in the sequence. 
If all ii. , lj G JC involved are +oo (i.e., none of postures 
have been performed), the above temporal-order compari¬ 
son fails, and we output a special code 0 to denote this case. 

We illustrate an FTA by peak example in Figure 1, where 
we plot three posture sequences S/ 15 S/ 2 ,S/ 3 , each corre¬ 
sponding to a row of matrix S. For simplicity, we assume 
that the peak of three sequences passes the preset threshold 
0. If we set k = 3 , all three sequences are involved, and 
the peak of Si 3 comes at first and thus the code of the cur¬ 
rent iteration produces a hash code 3 . Otherwise, if we set 


k = 2, a pair of sequences are randomly chose. Suppose we 
choose JC = {Zi,^ 2 }* Because ii 2 < %i x , we output j* = 2 
to encode that the peak of posture sequence I 2 comes first. 

FTA by peak is a much conservative approach to decide 
the temporal order between the postures. Usually well be¬ 
fore the peak, a posture has already been acted for a while. 
In contrast to this conservative approach, we introduce a 
more aggressive approach below. 

FTA by thresholding In this approach, we assume that the 
first time a posture is performed in a sequence is when its 
confidence score reaches the preset threshold 0 for the first 
time. Formally, we have 

% = min{*|S' i3 . ii > 9 } 

where lj G JC belong to the selected postures. By conven¬ 
tion, we set %i. to +oc if the above set is empty, denoting 
this posture l has never been acted in the sequence. Ac¬ 
cordingly, the index of the posture first performed is also 
determined by Eq. (1). 

In an extreme case that there is no posture from JC pass¬ 
ing the threshold test, all %i would be +oo for lj G JC. 
Then we produce a special hashing code 0 to encode the 
video. Therefore, in each iteration, we have a code book 
{0,1, • • • , k} with k + 1 entries for the encoding of a 
video. Apparently, a (k + l)-ary code can be represented 
by |~log(fc + 1 )] binary bits. We repeat the above coding 
iterations p times, and we will get p (k + l)-ary codes or 
equivalently p [log (k + 1)] binary codes to encode an input 
sequence. 

An algorithmic overview of our method is shown in Al¬ 
gorithm 1. 


Algorithm 1 FTA Hashing 

l: procedure FTA(m, k, p, W, X, 0) 

2: Generate W from Gaussian distribution; 

3: Set S 4- W T X; 

4: Initialize b as an empty binary sequence. 

5: for r = 1 to p do 

6: Randomly select k rows JC from S; 

7: for j = 1 to k do 

8: Compute the first-acting time %i for lj G JC', 

9 : end for 

10: Set 7* <— arg min ir., s.t. L G JC 

11: b 4— b U {j *} 

12 : end for 

13: return b 
14: end procedure 


Apparently, the complexity to compute one hash code is 
0(k log n + log k), the total complexity for p hash codes as 
well as the cost for the random projection is 0(p(k log n + 
log k) + mdn). Considering k « n, the total complexity 
















can be estimated as 0(p(klogn) + mndn), which is linear 
with respect to the input arguments. 

3.2. The Invariance Properties of FTA 

As mentioned in section 1, it is very challenging to 
model the temporal characteristics of action videos due to 
the temporal translation and warping as well as the variation 
in motion scales and execution rates. In this subsection, we 
show the nice properties of the FTA hashing, demonstrating 
that it is insensitive to the above variations by its temporal 
order-preserving nature. 

Figure 2 illustrates some running examples to show these 
properties. We discuss the examples using the FTA by peak 
version and set k to 2. The result is equally applicable to 
FTA by thresholding. 

Following the notations in Section 3.1, we use X, X' 
to denote two video sequences of the same action class. 
Two postures h ^2 are obtained by projecting both video 
sequences into the corresponding subspaces, and we apply 
FTA by peak to encode the temporal order of these postures. 
We use Si x , S\ 2 and S[ , S[ to denote the confidence scores 
of posture l\ and I 2 on X, X', respectively. The occurrence 
of the peaks are at time ii x , ii 2 for video X, and i ' ti , i\ 2 for 
video X'. 

Temporal Translation Invariance In Figure 2(a), although 
the peaks of postures li (red) and 1 2 (blue) are at different 
locations for X and X' due to the temporal translation, the 
relationship ii x < i\ 2 and i\ < i\ 2 are consistent for these 
two video of the same class. The FTA hashing produces the 
code 1 for both videos. 

Motion Scale Invariance In Figure 2(b), although Si 19 Si 2 
(solid line) have a larger scale than S ' ti , S[ (dash line), their 
peak order remains the same and the FTA hashing produces 
the same hash code for the videos of the same class. 
Execution Rate Invariance In Figure 2(c), S'/ 1 and Sj 2 
(solid line) are squeezed due to the execution rate variation. 
For example, some people perform the action faster than the 
other people. However, the order of ii x < ii 2 and i\ < i\ 2 
are not affected by this variations and the FTA hashing pro¬ 
duces the same code for X and X'. 

4. Experiments 

In this section, we demonstrate the experiment results on 
several 3D action video datasets. 

4.1. Baseline Method 

As performing the hashing on the video level is a new 
problem, to our best knowledge, there is no existing meth¬ 
ods in literature that can serve as the baseline. Thus we 
introduce a “Bag-of-Words (BOW)” style method as the 
baseline for the comparison. The method also employs the 
random projection to each frame to produce a sequence of 


latent postures. However, it views each video as a bag 
and each posture as an item in the bag. In other words, 
it does not consider any temporal orders between the pos¬ 
tures. Specifically, the BOW algorithm performs a thresh¬ 
olding test to decide whether an item of posture l exists in 
the video, i.e., the hash code ii is given as 

0, if max Sn < 6, 

1 <i<n ’ 

1, otherwise. 

where 6 is a threshold to detect the existence of the pos¬ 
ture l according to its confidence score Sj_ : in the sequence. 
The above process is iterated to generate a sequence of bi¬ 
nary hash bits. Since the BOW method does not leverage 
any temporal information, it can serve as a baseline method 
to validate the advantage of the temporal order-preserving 
FTA hashing algorithm. 

4.2. Feature Extraction 

Since the feature extraction is not the contribution of the 
current paper, we adopt the following four types of features 
commonly used in 3D human action recognition tasks. 

1. Pairwise-joint distance (PJD): the normalized dis¬ 
tance between a pair of joints [17, 26]; 

2. Joint offset feature (JO): the normalized joint offset 
from two consecutive frames [25]; 

3. Pairwise-angle feature (PA): the cosine of the angle 
between a pair of body segments [24] ; and 

4. Histogram of Velocity Components (HVC): the his¬ 
togram of the 3D velocity of the point cloud in the 
neighborhood of the joints [24]. 

4.3. Experiment Setting 

The parameter 6 for the BOW, both versions of FTAs 
are chosen by the 5-fold cross validation on the training 
set. Considering the proposed hashing algorithm is based 
on random projection and selection of postures, we repeat 
the experiment for 50 runs and report the average accuracy 
as the results in all experiments. The Hamming distance is 
used as the distance metric for the KNN search to predict 
the label of an unknown test sequence. 

4.4. Experiment Results 

We conduct performance evaluations on public three 
mostly used 3D action video datasets. It is worth noting that 
the proposed FTA hashing algorithm is especially amenable 
to large-scale tasks, however, to our best knowledge, there 
is no extremely large 3D action datasets publicly available 
in literature. But the results demonstrated on these datasets 
should suffice to show the competitive performance of the 
proposed algorithm. 



(a) translation invariance (b) scale invariance (c) execution rate invariance 


Figure 2: Running examples to demonstrate the invariance properties of the FTA hashing. Two postures l 1 , Z 2 (red and blue) 
are investigated on two videos X, X' (solid and dash) of the same action class. 


UTKinect-Action Dataset 

The UTKinect-Action dataset [19] consists of 10 action 
types performed by 10 subjects. All subjects perform each 
action twice. Since subjects are free to move in the environ¬ 
ment, the dataset is very challenging due to the huge view¬ 
point variation and intra-class variance. We follow the same 
cross-subject test setting from [1 ]. 

Experimental results are summarized in Table 1 in which 
we compare the recognition accuracy of the BOW, the FTA 
by peak and the FTA by thresholding with four types of fea¬ 
tures. The (k ± l)-ary code length p is set to 1,000, k is 
set to 2. The FTA by peak produces the accuracy of 90.20% 
and 86.57% on the PA feature and the PJD feature, respec¬ 
tively. This is a very impressive performance considering 
that we only use the hashing and the approximate nearest 
neighbor search by the Hamming distance. As shown, both 
the FTA by peak and the FTA by thresholding outperform 
the baseline BOW method by more than 10%, demonstrat¬ 
ing the contribution of the modeling temporal order to per¬ 
formance of the FTA hashing. We also note that FTA by 
peak has a higher performance than the FTA by threshold¬ 
ing. This is probably because the peak is a more robust 
estimate of occurrence time of a posture than the onset time 
passing a confidence threshold. 

It is also worth noting that the performances vary across 
different features. The PA feature achieves an accuracy of 
90.2% while the HVC feature has only 69.6 in accuracy. 
This is normal because the discriminative capabilities of dif¬ 
ferent features are different. We report the accuracy of mul¬ 
tiple features to show the performance of FTA can consis¬ 
tently outperform the BOW. The recognition accuracy can 
be further boosted by fusing multiple features (e.g. concate¬ 
nation) but this is out of the scope of this paper. 

Next, we further study the performance with respect to 
different k (i.e., the (k ± l)-ary code used by FTA) as well 
as the threshold 0. We only report the result on the FTA by 
peak since it is consistently better than FTA by threshold¬ 


Feature; 

> BOW 

FTA by 
Thresholding 

FTA by Peak 

PJD 

76.97 ± 1.63 

84.44 ± 2.29 

86.57 ±2.52 

JO 

72.02 =b 2.65 

71.01 ±2.23 

73.43 ± 2.07 

PA 

78.28 =b 2.98 

85.76 ±1.81 

90.20 ± 1.43 

HVC 

59.69 =b 1.46 

66.06 ± 1.18 

69.60 ± 1.93 


Table 1: Performance comparison between different hash¬ 
ing methods on the UTKinect-Action dataset (fc = 2, p = 
1000, accuracy in terms of %). 



Figure 3: Relationship between k and the accuracy on dif¬ 
ferent features on the UTKinect-Action dataset ((fc ± l)-ary 
code length is set to 1000). 

ing. Figure 3 shows the accuracy versus k when the code 
length is set to 1000. The accuracy drops when k increases, 
where k = 2 gives the best performance. This is because 
a larger k may produce redundant codes when comparing 
a large number of postures - postures may be dominated 
by a few more salient postures resulting in the loss of the 
discriminative information. 

Figure 4 shows the effect of code length on the perfor- 
























































Figure 4: Relationship between the (k + l)-ary code length Figure 5: Relationship between the threshold and the accu- 
and the accuracy on different features on the UTKinect- racy 0 n different features on the UTKinect-Action dataset 
Action dataset (k is set to 2). (& i s se t to 2 and (k + l)-ary code length is set to 1000). 


mance when k = 2. The accuracy increases while the code 
length grows. The accuracy remains stable after the code 
length reaches 1,000. Note that the code length is with 
respect to the (k + l)-ary code. In other words, we use 
|~log(fc + 1)] binary bits to encode a (k + l)-nary code. 

As shown, the entire video requires only 100 (fc + l)-ary 
hash codes (i.e., 200 bits of codes when k = 2) to achieve an 
accuracy higher than 80% with PJD and PA features. Sup¬ 
pose if a video has 50 frames, it means we only need 
4 bits per frame, which is extremely efficient considering 
state-of-the-art supervised hashing methods for the image 
retrieval normally needs 32 ^ 64 bits for a single image to 
achieve a comparable performance [6]. This shows signifi¬ 
cant efficiency the proposed FTA can achieve. 

Finally, we also show the effect of the threshold on the 
performance of the FTA by peak. In this experiment, k is 
fixed to 2, the (fc+l)-ary code length is set to 1, 000. In Fig¬ 
ure 5, consistent performance has been shown on accuracy 
versus threshold across all four features. When the thresh¬ 
old is set to a small value, many false postures can pass the 
threshold test. On the contrary, when the threshold is set 
to a larger value, true postures can fail the test resulting a 
hash code full of ”0” bits. Both cases would compromise 
the accuracy. Thus the threshold should be set in a reason¬ 
able range for the satisfactory level of accuracy by avoiding 
false positive or false negative detection of latent postures. 


Features 

BOW 

FTA by 
Thresholding 

FTA by Peak 

PJD 

44.79 + 2.15 

44.02 ± 1.60 

50.23 + 1.09 

JO 

50.69 + 1.92 

58.97 ±1.38 

77.81 + 2.25 

PA 

49.58 + 1.29 

51.65 + 1.05 

55.94 + 2.12 

HVC 

49.27 + 2.86 

50.54 + 1.34 

57.32 + 2.27 


Table 2: Performance comparison between different hash¬ 
ing methods on the MSR Action3D dataset (k = 2, p = 
1000, accuracy in terms of %). 

Action dataset, we compare the performance among the 
baseline BOW, FTA by threshold and the FTA by peak. 
Results are reported in Table 2. Again, FTA by peak has 
achieved the best accuracy than the other two methods 
across all features. Different features have produced dif¬ 
ferent level of accuracies. The JO feature achieves an accu¬ 
racy of 77.81% while the PJD feature has only 50.23% in 
accuracy. 

We also evaluate the impact of k and code length on the 
accuracy of the FTA by peak algorithm, and show the re¬ 
sults in Figure 6 and Figure 7. As shown, the recognition 
accuracy drops as k increases, while a longer code length 
usually produces higher accuracy. This results are consis¬ 
tent with the results on the MSR Action3D dataset. 


MSR Action3D Dataset 

The MSR Action3D dataset [7] covers 20 sports action 
types and 10 subjects. All subjects perform each action two 
or three times. The dataset is very challenging due to the 
high intra-class variations. We follow the same experiment 
settings in [17]. 

Similar to the experiment in the previous UTKinect- 


MSRActionPairs Dataset 

The MSRActionPairs dataset [11] consists of 12 action 
types performed by 10 subjects. Each subject performs ev¬ 
ery action three times. This dataset contains 6 pairs of sim¬ 
ilar actions which has exactly the same poses but different 
temporal orders. For example, ’’Pick up” and ’’Put down”, 
’’Push a chair” and ’’Pull a chair”. This dataset is very suit- 































k 


Features 

BOW 

FTA by 
Thresholding 

FTA by Peak 

PJD 

50.74 ± 1.95 

59.71 ±2.28 

70.74 ± 1.72 

JO 

50.40 ± 3.47 

57.25 ± 2.73 

53.6 ±1.72 

PA 

57.37 ± 1.32 

61.48 ± 1.18 

71.48 ±2.26 

HVC 

64.11 ±3.48 

75.20 ±2.20 

86.05 ±2.61 


Table 3: Performance comparison between different hash¬ 
ing methods on the MSRActionPairs dataset (Accuracy in 
terms of %). 


Figure 6: Relationship between k and the accuracy on dif¬ 
ferent features on the MSR Action3D dataset, ((fc ± l)-ary 
code length is set to 1000) 



Figure 7: Relationship between the (k ± l)-ary code length 
and the accuracy on the MSR Action3D dataset. 


able to evaluate the temporal order-preserving capability of 
the proposed FTA hashing algorithm. We follow the same 
test setting of [11]. 

Table 3 compares the results among the FTA by peak, 
FTA by thresholding and the baseline BOW algorithm. The 
recognition accuracy of the FTA by peak significantly out¬ 
performs the BOW algorithm by 15 ^ 20% on most of 
the features. Since the MSRActionPairs dataset is very sen¬ 
sitive to the temporal order of the action sequences, the 
FTA hashing can effectively distinguish between different 
actions with similar postures but in different temporal or¬ 
ders. On the contrary, BOW does not encode the temporal 
structure, and is incapable of handling this challenging set¬ 
ting on this dataset. 

In addition, we perform the same set of experiments on 
the impact of k and the code length on the performance in 
Figure 8 and Figure 9. Similar results are observed as for 
the other two datasets. 



k 


Figure 8: Relationship between k and the accuracy on the 
MSRActionPairs dataset. 



Figure 9: Relationship between the (k ± l)-ary code length 
and the accuracy on the MSRActionPairs dataset. 

5. Conclusions 

In this paper, we revisit the human action recognition 
problem from a hashing perspective and propose a novel 
First-Take-All hashing algorithm to interpret the temporal 
patterns of the entire video. The FTA hashing preserves the 
temporal order of the action sequence and achieves invari¬ 
ance to the temporal translation, motion scale as well the 
execution rate. Experiment results on three public 3D hu- 























































man action datasets have demonstrated the performance and 
the efficiency of the proposed FTA hashing. 

The current work is based on random projection. We 
would like to further enhance the performance of the 
FTA hashing and shrink the code length by leveraging the 
learning-based method in our future work. 
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